home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1993
/
Internet Info CD-ROM (Walnut Creek) (1993).iso
/
standards
/
misc
/
TEI.basics
< prev
next >
Wrap
Internet Message Format
|
1993-07-15
|
13KB
From caasi@ucselx.sdsu.edu Wed Oct 3 20:12:20 1990
Return-Path: <caasi@ucselx.sdsu.edu>
From: caasi@ucselx.sdsu.edu (richard)
Subject: Basics of the TEI, part 1: design goals
To: bzs@world.std.com
Date: Fri, 31 Aug 90 8:04:54 PDT
X-Mailer: ELM [version 2.2 PL0]
Date: Fri, 17 Aug 90 10:46:03 CDT
Comments: "ACH / ACL / ALLC Text Encoding Initiative"
From: Michael Sperberg-McQueen 312 996-2477 -2981 <U35395@UICVM.uic.edu>
Subject: Basics of the TEI, part 1: design goals
This list has had a lot of recent subscriptions in response to the
announcement that the TEI Guidelines are now available in draft form;
TEI-L now goes to over 275 addresses. The 600 pre-printed copies of the
draft, which we originally thought might be a bit too many to get rid of
in the year before version 2 is ready, may at this rate all be spoken
for before the month of August is out.
We're happy about all the interest, because it suggests that many
others agree with the organizers of the TEI that we need methods for
text encoding suitable for multiple uses of the same texts, for exchange
of texts among researchers and others interested, for languages other
than English and scripts other than Latin, and which will work with all
kinds of text, not only the most common.
This list should play a big role in the revision of the Guidelines,
and to help get the relevant discussion started, it might be a good idea
for the editors to discuss from time to time some of the background to
the current draft -- a sort of TEI tutorial over the net. This will, we
hope, provoke some questions from participants in the list, and will
lead over time to discussions of the many thorny technical and other
issues involved with a project like this. Much of what we say at the
beginning may seem (or be) basic and uncontroversial, and those who like
fireworks may wish we would jump right to the burning questions and get
some arguments going. It appears though that some of the noncontrover-
sial basics are essential to even understanding some of the trickier
burning questions, so we are going to go slow at first. Anyone who
wants to start a second thread on any burning issue of their choice may
do so.
We count on the many participants in this list who are serving on the
TEI working committees to jump in and amplify or supplement our account
wherever you see fit.
WHO IS THE TEI FOR?
Let's start with something fairly simple: who is the TEI for and
what are the basic goals?
The goals of the TEI are to define a format for encoding texts in a
linear data stream which is suitable for the interchange of textual
material between researchers, and to provide concrete recommendations,
for those who can use them, as to what features of texts should usually
be recorded. As the letterhead puts it, the TEI is an "Initiative for
Text Encoding Guidelines and a Common Interchange Format for Literary
and Linguistic Data". Note some non-obvious points:
1. The TEI came out of the community of those using computers to do
research on or with texts, and they are our primary constituency.
That is: literary scholars, linguists, computational linguists,
historians, philosophers, theologians, philologists, people work-
ing on machine translation, ... you name it. The publishing
industry, database vendors, software developers, and others with
commercial interests in electronic text are interested in the TEI,
and many are sharing their expertise with us, but they are not the
*primary* constituency. If research and publishing were to turn
out to require different things, the TEI would go with the needs
of researchers.
It's important to note that this is mostly an imaginary issue:
so far the requirements of all these groups seem astonishingly
close to identical. Very concretely: I have not encountered a
single problem faced by humanists which does not have an analogue
in a problem faced by linguists, and one in a problem faced by
publishers or commercial database vendors. And vice versa. Some-
times the problems look different, but so far most differences
have proven superficial. We believe that what will work for
researchers must work for other applications as well. So in a
real sense, though researchers are the primary constituency, the
real intended constituency is everyone who works with electronic
text in *any* way, and wants to be able (a) to move the text from
system to system without information loss, or (b) to use the text
for more than one thing.
2. One major intended use for the Guidelines is as a specification
for an interchange format. Transfer between researchers,
machines, programs, networks would use such a format very simply:
as a description of what my text will look like when it passes
from my hands to yours, or what I would like yours to look like
when yours reaches me. An interchange format does not tell anyone
what to encode, any more than the ASCII code tells us how to write
novels or manuals. What is encoded is the intellectual responsi-
bility of the researcher; no one can take that responsibility
away.
3. The other major intended use is as a guide for those encoding
texts for general use (and one hopes that that includes most of
those encoding texts). The Guidelines should provide a sample set
of textual features that many people have found useful in textual
work, together with ways of encoding those features. No one is
required to encode all those textual features, but the list should
(if we do our work right) be taken seriously as a checklist of
what the community as a whole tends to find useful.
Software developers should also benefit from the guidelines in both
these ways: as a definition of an export-import format (or as an inter-
nal file format, if you wish!) *and* as a checklist of textual features
commonly thought important. I suppose many of us have seen software
which suffered from its makers' sometimes unconsciously narrow concep-
tion of the kinds of texts it would be used for -- the Guidelines should
be useful as a sort of brain-storming, concept-broadening tool for
developers.
1.1 Basic requirements
The basic requirements for a text encoding scheme have been stated in
the NEH proposals for TEI funding. (Quick tip of the hat to the NEH,
the EEC, and the Mellon Foundation for their funding. Without them, it
wouldn't be happening nearly as fast.)
An encoding scheme is any (systematic?) method of representing or
encoding textual data in machine-readable form. Typically, an encoding
scheme must include:
1. methods for recording the characters in the text (including dia-
critics, special symbols, non-Roman alphabets, etc.)
2. conventions for rendering a text in a single linear sequence
(specifying how footnotes, end-notes, critical apparatus, parallel
texts, and other non-linear complications are handled)
3. methods for recording logical divisions of texts (e.g. book, chap-
ter, paragraph; act, scene, speech, line; ...)
4. methods for recording analytic information like literary or lin-
guistic analysis
5. conventions for delimiting in-line comments and other ancillary
material
6. conventions for identifying the text being encoded and those
responsible for encoding it
To create a single encoding scheme suitable for common use, the TEI
first formulated (in the original planning conference in 1987 and in
working papers since) the following requirements for the scheme to be
developed:
1. It should specify a common interchange format.
2. It should provide a set of recommendations for encoding new textu-
al materials.
3. It should document the existing major schemes and investigate the
feasibility of developing a metalanguage in which to describe
them.
4. It must be a set of guidelines, not a set of rigid requirements.
5. It must be extensible.
6. It should be device- and software-independent.
7. It should be language-independent.
8. It should be application-independent.
As design goals, it was specified that the guidelines should:
1. suffice to represent the textual features needed for research
2. be simple, clear, and concrete
3. be easy for researchers to use without special-purpose software
4. allow the rigorous definition and efficient processing of texts
5. provide for user-defined extensions
6. conform to existing and emergent standards
We can expatiate on these, if anyone isn't sure what we mean by
them, but I won't here.
The current draft, be it noted, does *not* solve all these problems
or wholly fulfill all of the design goals. It wasn't expected to --
some of the hard problems were intentionally saved for the second cycle.
Here is my personal checklist of where we stand with respect to the
goals listed above (which as you can tell from the overlaps were taken
>from different documents).
* The current draft (version 1.0) does specify both an interchange
format and recommendations, though perhaps not as explicitly as one
might have expected. It may need to become more explicit in defin-
ing the interchange format.
* It does not document any existing encoding schemes, though work is
continuing on that topic.
* The metalanguage and syntax committee did consider the formulation
of a metalanguage for defining existing schemes, but decided against
it. Descriptions will take the form of prose and of algorithms for
translating from a given scheme into the TEI scheme, using a variety
of existing software tools (e.g. sed scripts, Rexx execs, Snobol
programs, or even yacc and lex code).
* It is certainly a set of guidelines rather than requirements, and
device- and software-independent. It is also, however, not fully
implemented in software -- this has the advantage that the design is
not unduly biased by implementation issues, but it makes it hard to
demonstrate or validate the scheme.
* It is extensible, but the mechanisms for specifying extensions need
work to be usable without heavy-duty knowledge of SGML.
* It has no bias that we have consciously put there in favor of any
one language, but the TEI has not addressed, let alone solved, the
problems of languages other than those already most effectively cov-
ered by international data-processing standards. The current draft
is silent on topics where people need the most guidance: older
forms of languages not covered by ISO standards, Asian scripts,
treatment of bidirectional text (e.g. Hebrew and English), and so
on. We expect to work on these in the next two years, but for some
issues there is little we can do but document and call attention to
existing methods of handling these problems (e.g. ISO 10646 or the
Unicode effort -- two unfortunately incompatible approaches to han-
dling Chinese and other Asian scripts).
* It does provide what we think is an adequate *basis* for handling
all the known needs of research; it probably needs extension in many
areas to provide not just the *basis* for the required solutions,
but some version of the solutions themselves.
* It's as simple and clear as we could make it, but we expect to hear
about lots of obscurities in the draft. (Let's say it again--please
let us know if there are things that aren't clear!)
* It can be used without special software, at least at the simpler
levels. A lot of work is needed, however, before we have something
we can hand to the average literary scholar who uses Nota Bene or
Word Perfect or Microsoft Word and wants to create a TEI-conformant
file. (Volunteer macro-writers sought!)
* So far, at least, the Guidelines can be used as specified in the ISO
standard which defines SGML. There are some technical reasons which
mean that the TEI guidelines may not be definable as a "conforming
application" of SGML -- these mostly relate to syntactic freedoms of
SGML which are forbidden by the current version of the Guidelines.
That's it for the basic goals of the TEI. Coming up: discussions of
SGML basics, the TEI tags for core structural features, other core tags
in the TEI scheme, and character-set issues. After that, we should be
able to raise some of the more advanced questions.
-Michael Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago